18 research outputs found
How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning
Shortcut learning, or `Clever Hans effect` refers to situations where a
learning agent (e.g., deep neural networks) learns spurious correlations
present in data, resulting in biased models. We focus on finding shortcuts in
deep learning based spoofing countermeasures (CMs) that predict whether a given
utterance is spoofed or not. While prior work has addressed specific data
artifacts, such as silence, no general normative framework has been explored
for analyzing shortcut learning in CMs. In this study, we propose a generic
approach to identifying shortcuts by introducing systematic interventions on
the training and test sides, including the boundary cases of `near-perfect` and
`worse than coin flip` (label flip). By using three different models, ranging
from classic to state-of-the-art, we demonstrate the presence of shortcut
learning in five simulated conditions. We analyze the results using a
regression model to understand how biases affect the class-conditional score
statistics.Comment: Interspeech 202
Voice Mimicry Attacks Assisted by Automatic Speaker Verification
International audienceIn this work, we simulate a scenario, where a publicly available ASV system is used to enhance mimicry attacks against another closed source ASV system. In specific, ASV technology is used to perform a similarity search between the voices of recruited attackers (6) and potential target speakers (7,365) from VoxCeleb corpora to find the closest targets for each of the attackers. In addition, we consider 'median', 'furthest', and 'common' targets to serve as a reference points. Our goal is to gain insights how well similarity rankings transfer from the attacker's ASV system to the attacked ASV system, whether the attackers are able to improve their attacks by mimicking, and how the properties of the voices of attackers change due to mimicking. We address these questions through ASV experiments, listening tests, and prosodic and formant analyses. For the ASV experiments, we use i-vector technology in the attacker side, and x-vectors in the attacked side. For the listening tests, we recruit listeners through crowdsourcing. The results of the ASV experiments indicate that the speaker similarity scores transfer well from one ASV system to another. Both the ASV experiments and the listening tests reveal that the mimicry attempts do not, in general, help in bringing attacker's scores closer to the target's. A detailed analysis shows that mimicking does not improve attacks, when the natural voices of attackers and targets are similar to each other. The analysis of prosody and formants suggests that the attackers were able to considerably change their speaking rates when mimicking, but the changes in F0 and formants were modest. Overall, the results suggest that untrained impersonators do not pose a high threat towards ASV systems, but the use of ASV systems to attack other ASV systems is a potential threat.
Can We Use Speaker Recognition Technology to Attack Itself? Enhancing Mimicry Attacks Using Automatic Target Speaker Selection
(A slightly shorter version) has been submitted to IEEE ICASSP 2019We consider technology-assisted mimicry attacks in the context of automatic speaker verification (ASV). We use ASV itself to select targeted speakers to be attacked by human-based mimicry. We recorded 6 naive mimics for whom we select target celebrities from VoxCeleb1 and VoxCeleb2 corpora (7,365 potential targets) using an i-vector system. The attacker attempts to mimic the selected target, with the utterances subjected to ASV tests using an independently developed x-vector system. Our main finding is negative: even if some of the attacker scores against the target speakers were slightly increased, our mimics did not succeed in spoofing the x-vector system. Interestingly, however, the relative ordering of the selected targets (closest, furthest, median) are consistent between the systems, which suggests some level of transferability between the system
Gamified Speaker Comparison by Listening
We address speaker comparison by listening in a game-like environment,
hypothesized to make the task more motivating for naive listeners. We present
the same 30 trials selected with the help of an x-vector speaker recognition
system from VoxCeleb to a total of 150 crowdworkers recruited through Amazon's
Mechanical Turk. They are divided into cohorts of 50, each using one of three
alternative interface designs: (i) a traditional (nongamified) design; (ii) a
gamified design with feedback on decisions, along with points, game level
indications, and possibility for interface customization; (iii) another
gamified design with an additional constraint of maximum of 5 'lives' consumed
by wrong answers. We analyze the impact of these interface designs to listener
error rates (both misses and false alarms), probability calibration, time of
quitting, along with survey questionnaire. The results indicate improved
performance from (i) to (ii) and (iii), particularly in terms of balancing the
two types of detection errors.Comment: Accepted to Odyssey 2022 The Speaker and Language Recognition
Worksho